预训练的语言模型(PLM)在自然语言理解中的许多下游任务中取得了显着的性能增长。已提出了各种中文PLM,以学习更好的中文表示。但是,大多数当前模型都使用中文字符作为输入,并且无法编码中文单词中包含的语义信息。虽然最近的预训练模型同时融合了单词和字符,但它们通常会遭受不足的语义互动,并且无法捕获单词和字符之间的语义关系。为了解决上述问题,我们提出了一个简单而有效的PLM小扣手,该小扣子采用了对单词和性格表示的对比度学习。特别是,Clower通过对多透明信息的对比学习将粗粒的信息(即单词)隐式编码为细粒度表示(即字符)。在现实的情况下,小电动器具有很大的价值,因为它可以轻松地将其纳入任何现有的基于细粒的PLM中而无需修改生产管道。在一系列下游任务上进行的扩展实验表明,小动物的卓越性能超过了几个最先进的实验 - 艺术基线。
translated by 谷歌翻译
In this report, we focus on reconstructing clothed humans in the canonical space given multiple views and poses of a human as the input. To achieve this, we utilize the geometric prior of the SMPLX model in the canonical space to learn the implicit representation for geometry reconstruction. Based on the observation that the topology between the posed mesh and the mesh in the canonical space are consistent, we propose to learn latent codes on the posed mesh by leveraging multiple input images and then assign the latent codes to the mesh in the canonical space. Specifically, we first leverage normal and geometry networks to extract the feature vector for each vertex on the SMPLX mesh. Normal maps are adopted for better generalization to unseen images compared to 2D images. Then, features for each vertex on the posed mesh from multiple images are integrated by MLPs. The integrated features acting as the latent code are anchored to the SMPLX mesh in the canonical space. Finally, latent code for each 3D point is extracted and utilized to calculate the SDF. Our work for reconstructing the human shape on canonical pose achieves 3rd performance on WCPA MVP-Human Body Challenge.
translated by 谷歌翻译
我们提出了联合隐式功能(UNIF),这是一种基于原始扫描和骨骼作为输入的人类重建和动画的零件方法。先前的基于部分的人重建方法依赖于SMPL的地面零件标签,因此仅限于最小衣服。相比之下,我们的方法学会了将部分与身体运动分开,而不是部分监督,因此可以扩展到穿衣服的人类和其他铰接的物体。我们的分区从动作进行分区是通过以骨骼为中心的初始化,骨限度损失和正常损失来实现的,即使训练姿势受到限制,也可以确保稳定的零件分裂。我们还为SDF提供了最小的周边损失,以抑制额外的表面和部分重叠。我们方法的另一个核心是一种相邻的部分接缝算法,该算法会产生非刚性变形,以维持显着缓解基于部分伪像的部分之间的连接。在该算法下,我们进一步提出了“竞争部分”,该方法通过点对骨骼而不是绝对位置的相对位置定义了重量,从而避免了神经隐式函数的概括性问题(线性混合皮肤)。我们通过在CAPE和ClothSeq数据集上穿衣服的人体重建和动画来证明我们方法的有效性。
translated by 谷歌翻译
直到最近,通过深度加强学习(DRL)已经实现了强大的机器人机器人。然而,为了高效地学习参数化的BipeDal行走,通常需要开发的参考资料,将性能限制为参考文献的性能。在本文中,我们建议设计用于从参考文献的模仿学习的自适应奖励功能。鼓励代理人模仿当其性能较低时的参考资料,同时在达到参考资料的限制时追求高性能。我们进一步证明,开发的参考资料可以通过在没有费力调整的情况下产生的低质量参考,并且只要他们能够提供先验的知识即可加快学习过程即可才能部署。
translated by 谷歌翻译
在这项工作中,我们专注于互动人类解析(IHP),旨在将人体形象分成多个人体部位,具有来自用户的相互作用的指导。这项新任务继承了人类解析的类感知属性,其无法通过通常是禁止类别的传统交互式图像分割方法很好地解决。为了解决这项新任务,我们首先利用用户点击以识别给定图像中的不同人为部分。随后将这些点击转换为语义感知的本地化映射,其与RGB图像连接以形成分割网络的输入并生成初始解析结果。为了使网络能够更好地了解用户在校正过程中的目的,我们调查了改进的几个主要方法,并揭示了基于随机采样的点击增强是推广校正效果的最佳方式。此外,我们还提出了一种语义感知损失(SP损失)来增加培训,这可以有效利用点击的语义关系以获得更好的优化。为了最好的知识,这项工作是第一次尝试在交互式设置下解决人类解析任务。我们的IHP解决方案在基准嘴唇上实现了85 \%Miou,Pascal-Person-Part和CiHP,75 \%Miou,只有1.95,3.02,2.84和每班3.09点击的Helen。这些结果表明,我们只需几个人类努力就可以获得高品质的人类解析面具。我们希望这项工作能够激励更多的研究人员在未来为IHP开发数据有效的解决方案。
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译
Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.
translated by 谷歌翻译